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Abstract 

Motivated by data-rich experiments in transcriptional regulation and sensory neuro- 
science, we consider the following general problem in statistical inference. When ex- 
posed to a high-dimensional signal S, a system of interest computes a representation R 
of that signal which is then observed through a noisy measurement M. From a large 
number of signals and measurements, we wish to infer the "filter" that maps S to R. 
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However, the standard method for solving such problems, likelihood-based inference, 
requires perfect a priori knowledge of the "noise function" mapping R to M. In prac- 
tice such noise functions are usually known only approximately, if at all, and using an 
incorrect noise function will typically bias the inferred filter. Here we show that, in the 
large data limit, this need for a pre-characterized noise function can be circumvented 
by searching for filters that instead maximize the mutual information I[M; R] between 
observed measurements and predicted representations. Moreover, if the correct filter 
lies within the space of filters being explored, maximizing mutual information becomes 
equivalent to simultaneously maximizing every dependence measure that satisfies the 
Data Processing Inequality. It is important to note that maximizing mutual information 
will typically leave a small number of directions in parameter space unconstrained. We 
term these directions "diffeomorphic modes" and present an equation that allows these 
modes to be derived systematically. The presence of diffeomorphic modes reflects a 
fundamental and nontrivial substructure within parameter space, one that is obscured 
by standard likelihood-based inference. 

1 Introduction 

This paper discusses a familiar problem in statistical inference, but focuses on an under- 
studied limit that is becoming increasingly relevant in the era of large data sets. Con- 
sider an experiment having the following form: 



filter 



noise function 




9(S) 




tt(M\R) 



M 



(1) 



measurement 
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When presented with a signal S, a system of interest applies a deterministic filter 9 
thereby producing an internal representation R of that signal. For each representation 
R, a noisy measurement M is then generated. The conditional probability distribution 
7r(M\R) from which M is drawn is called the "noise function" of the system. From 
data consisting of N signal-measurement pairs, {S n , M n }^ =1 , we wish to reconstruct 
the filter 9. This paper focuses on how to infer 9 properly in the N — > oo limit when 
the noise function n is unknown a priori. 

All statistical regression problems have this "SRM" form (Bishop, 2006), but we 
will focus on two biological applications for which this problem is particularly relevant. 
In neuroscience, SRM experiments are commonly used to characterize the response of 
neurons to stimuli (Schwartz et al., 2006). For instance, S may be an image to which 
a retina is exposed, while M is a binary variable ('spike' or 'no spike') indicating the 
response of a single retinal ganglion cell. It is often assumed that the spiking probability 
depends on a linear projection R of S. The specific probability of a spike given R is 
determined by the noise function ir. 

More recently, analogous experiments have been used to characterize the biophys- 
ical mechanisms of transcriptional regulation. In the context of work by Kinney et al. 
(2010), S is the DNA sequence of a transcriptional regulatory region, R is the rate of 
mRNA transcription produced by this sequence, and M is a (noisy) measurement of the 
resulting level of gene expression. The filter 9 is a function of DNA sequence that re- 
flects the underlying molecular mechanisms of transcript initiation. The noise function 

7r accounts for both biological noise and instrument noise. 
1 Such as stochastic gene expression (Elowitz et al., 2002). 
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The standard approach for solving inference problems like these is to adopt a spe- 
cific noise function n, then search a space 6 of possible filters for the one filter 9 that 
maximizes the likelihood p({M n }\{S n }, 9, it) = e NL( - e ^\ where, 



is the per-datum log likelihood. For instance, the method of least squares regression 
corresponds to maximum likelihood inference assuming a homogenous Gaussian noise 
function n (Bishop, 2006). 

Although the correct filter 9 does indeed maximize L(9, it) when the correct noise 
function n is used, full a priori knowledge of this noise function is rare in practice. 
Often 7r is chosen primarily for computational convenience, as is standard with least- 
squares regression. This can be problematic because using an incorrect n will typically 
produce bias in the inferred filter 9, bias that does not disappear in the N — > oo limit. 
The reason for this is illustrated in Fig. 1. 

Sometimes this problem can be partially alleviated by performing a separate "cal- 
ibration experiment" in which the noise function n(M\R) is measured directly. For 
instance, one might be able to make repeated measurements M for a select number of 
known representations R. However, there will always be residual measurement error 
in 7r that will propagate to 9 in a manner that is not properly accounted for by simply 
plugging 7r into likelihood calculations via Eq. 2. 

An alternative inference procedure (Sharpee et al., 2004; Paninski, 2003; Kinney 
et al., 2007) that circumvents the need for an assumed noise function is to maximize the 




(2) 



n=l 
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mutual information (Cover & Thomas, 1991), 

I(9) = I[R;M] = J dRdM p(R,M) log ^'J^ (3) 

between predictions R and measurements M. Here, p(R, M) is the empirical joint 
distribution between predictions and measurements, and thus depends implicitly on 9. 
This method has been proposed, studied, and applied in the specific contexts of recep- 
tive field inference (Sharpee et al., 2004; Paninski, 2003; Sharpee et al., 2006; Pillow & 
Simoncelli, 2006) and transcriptional regulation (Kinney et al., 2007; Elemento et al., 
2007; Kinney, 2008; Kinney et al., 2010; Melnikov et al., 2012). However, this alterna- 
tive approach can be applied to a much wider range of statistical regression problems, 
and a general discussion of how maximizing mutual information relates to maximizing 
likelihood for arbitrary SRM systems has yet to be presented. 

We begin by pointing out that, in the N — > oo limit, maximizing mutual informa- 
tion over 9 alone is equivalent to maximizing likelihood over both 9 and it. We then 
prove that when the correct filter 9 lies within the class of filters being considered, 
maximizing mutual information is also equivalent to simultaneously maximizing ev- 
ery dependence measure that satisfies the Data Processing Inequality (DPI). However, 
in the absence of a known noise function n, SRM experiments are fundamentally in- 
capable of constraining certain directions in the parameter space of 9; we call these 
directions "diffeomorphic modes." An equation for diffeomorphic modes is described 
and then applied to filters having various functional forms. In particular, our analysis 
of a linear-nonlinear filter used by Kinney et al. (2010) to model transcriptional reg- 
ulation demonstrates how model nonlinearities can eliminate diffeomorphic modes in 

The notation 1(9) and I[R; M] will be used interchangeably. 
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useful and non-obvious ways. This has important consequences for biophysical studies 
of transcriptional regulation that use recently developed DNA-sequencing-based assays 
(Kinney et al., 2010; Melnikov et al., 2012). 

Throughout this manuscript, R is used to implicitly denote the representation pre- 
dicted by the filter 9 for signal S, i.e. R = 8(S). V(9) = V[R; M] is used to denote any 
DPI- satisfying dependence measure. Representations R are assumed to be multidimen- 
sional with components and <9 M = d/dR^. 9 is used to denote both a filter and the 
parameters governing that filter. 0 represents both an abstract space of filters, as well 
as the space of parameters for filters assumed to have a specific functional form. In the 
latter case, 9 l denotes coordinates in parameter space, and di = d/d9 l . 



2 Mutual information and likelihood 

We begin by discussing the connection between likelihood and mutual information in 
the N oo limit. In this limit, the per-datum log likelihood (Eq. 2) can be rewritten 

as, 

L(9,ir) = J dRdMp(R,M)logir(M\R) (4) 
= 1(9) - D(9,tt) - H[M\. (5) 

The first term, 1(9), is the mutual information between R and M (Eq. 3) and is inde- 
pendent of the noise function n. The second term, 

D(9,tt) = J d RdMp(R,M)\og^^-, (6) 

is the Kullback-Leibler (KL) divergence between the empirical distribution p(M\R), 
which results from the choice of 9, and the assumed noise function ir(M\R). The last 
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term, H[M] = - J dM p(M) log p(M), is the entropy of the measurements M. H[M] 
is independent of both 9 and n and can thus be ignored in the optimization problem. 

The key point is that finding maximally informative filters 9 is equivalent to solv- 
ing the maximum likelihood problem over both filters 9 and noise functions ir. This is 
because if 9 maximizes 1(9), simply choosing a noise function that matches the em- 
pirical noise function, i.e. setting ir(M\R) = p(M\R), will minimize D(9, it) and thus 
maximize L(9, it). 

If one can formalize prior assumptions about the noise function n using a Bayesian 
prior p(n), the relevant objective function becomes the per-datum marginal likelihood, 

L m (0) = ^ log J dnp(n)p({M n }\{S n }, 9, n). (7) 

This is analogous to Eq. 4 computed after all possible noise functions have been in- 
tegrated out. As has been shown in previous work (Kinney et al., 2007; Rajan et al., 
2013), maximizing marginal likelihood and maximizing mutual information are essen- 
tially equivalent in the N — > oo limit. This can be seen by decomposing L m {9) as, 

L m (9) = 1(9) - A(9) - H[M], (8) 

where, 



A(0) = -ilog 



dnp(ir)e 



-ND(0,n) 



(9) 



Under weak assumptions about the prior p(n)? A — > 0 as N — > oo (see Appendix A). 

-a 

E.g. p(n) does not vanish at the true noise function tt* . 
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3 DPI-optimal filters 

Mutual information is just one measure among many that satisfy DPI (see Appendix 
B). In this section, we discuss the importance of DPI for the SRM inference problem 
and introduce the notion of "DPI-optimal" filters. 

Paninski (2003) has argued as follows for using DPI-satisfying dependence mea- 
sures as objective functions for inferring filters. If 9* is the correct filter in an SRM 
experiment, then for every filter 6, 

e e* n* 

R - S ► R* - M, (10) 

is a Markov chain. This implies T>(9) < V{9*) for every DPI-satisfying measure V. If 
9* resides within the space 0 of filters being explored, it must therefore fall within the 
subset of 0£> C 6 on which V is maximized. As a simple extension of this argument, 
we point out that, because 9* maximizes all DPI-satisfying measures, 9* must actually 
lie within the intersection of all such sets, i.e., 

0*ee DP i= fl Q v (ID 

T> satisfying DPI 

Filters in 6 DP i can properly be said to be "DPI-optimal." 

This raises an important question: would optimizing a variety of different measures 
V, not just mutual information, narrow the search for 9*1 Here we show the answer 
is 'no'; when 9* e 0, maximizing mutual information is equivalent to simultaneously 
maximizing every DPI-satisfying measure, i.e., 

e, = e D pi. (12) 

To prove this, we first define on the space of all possible filters a weak and strong 
partial ordering, as well as an equivalence relation. These mathematical structures are 

8 
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a natural consequence of DPI. For any two filters 9\ and 9 2 , we write, 

weak ordering : 9 X <9 2 V{B X ) < V(6 2 ) for all V, (13) 

strong ordering : 9i < 9 2 Q\ < 9 2 but not 6 2 <6 l , (14) 

equivalence : 0\ ~ 9 2 -<=>- 9\ < 9 2 and 9 2 < Q\. (15) 

Note that 9 X < 9 2 if R x <->■ -R 2 ^ M is a Markov chain. The set 9 D pi of DPI-optimal 
filters is the supremum of 0 under this partial ordering. The equivalence 0/ = ©dpi, 
which occurs when 9* G 6, follows directly from the fact, proven in Appendix C, that 
9 < 9* implies 1(9) < 1(9*). We note that this is not true for all DPI-satisfying mea- 
sures. For instance, the trivial measure V = 0 satisfies DPI but reveals no information 
about whether a given 9 resides in ©dpi- These results are illustrated in Fig. 2. 

4 Diffeomorphic modes 

Whether or not two filters 9 1 and 9 2 satisfy the above equivalence relation (Eq. 15) can 
depend on the true filter 9* and on the specific noise function n* of the SRM experi- 
ment. However, certain pairs of filters will satisfy 9i ~ 9 2 under all SRM experiments. 
We will refer to such pairs of filters as being "information equivalent." In Appendix 
D we prove that two filters are information equivalent if and only if their predicted 
representations are related by an invertible transformation. 

As an objective function, mutual information is inherently incapable of distinguish- 
ing between information equivalent filters. In practice this means that selecting maxi- 
mally informative filters from a parametrized set of filters can leave some directions in 
^The subscripts 1 and 2 label two different filters, not two parameters of a single filter. 
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parameter space unconstrained. Here we term these directions "diffeomorphic modes." 

The diffeomorphic modes of linear filters have an important and well-recognized 
consequence in neuroscience: the technique of maximally informative dimensions can 
identify only the relevant subspace of signal space, not a specific basis within that sub- 
space (Sharpee et al., 2004; Paninski, 2003; Pillow & Simoncelli, 2006). However, an 
interesting twist occurs in applications to transcriptional regulation. Here, linear filters 
are often used to model the sequence-dependent binding energies of proteins to DNA 
(Stormo, 2013). Any mechanistic hypothesis about how DNA-bound proteins interact 
with one another predicts that the transcription rate will depend on these binding ener- 
gies in a specific nonlinear manner (Bintu et al., 2005; Stormo, 2013). Such up-front 
knowledge about the nonlinearities of linear-nonlinear filters can eliminate diffeomor- 
phic modes of the underlying linear filters in useful and non-obvious ways (Kinney, 
2008; Kinney et al., 2010). 

4.1 An equation for diffeomorphic modes 

Consider a filter 8, representing a point in 6, whose parameters 8 l are infinite simally 
transported along a vector field having components g % {8). This yields a new filter 8' 
with components 8 n = 8 l + eg l (8). If the representation R predicted by 8 for a specified 
signal 5* has components in representation space, these will be transformed to R'^ = 

If the vector field g l (8) represents a diffeomorphic mode of 0, this transforma- 
tion must be invertible, meaning the values ^ i g % {8)d i R IJ ' cannot depend on S except 
through the value of R. This is a nontrivial condition because diR can depend on the 

10 
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underlying signal S in an arbitrary manner. However, if ^ ji g % {9)diR tl does indeed 
depend only on the value of R then, 

Y,9\e)d i R^ = h^R,e). (16) 

i 

for some vector function h^(R, 9). This is the equation that any diffeomorphic mode 
g l {9) must satisfy. 

4.2 General linear filters 

We now use Eq. 16 to derive the diffeomorphic modes of general linear filters. By 
definition, a linear filter 9 yields a representation R that is a linear combination of 
signal "features" F?, i.e., 

R" = J2^Ft(S)- (IV) 

i 

As is standard with regression problems (Bishop, 2006), the term "linear" describes 
how R depends on the parameters 9 l \ the features F% need not be linear functions of S. 

To find the diffeomorphic modes of these filters, we apply the operator J2i 9 i (9)d i to 
both sides of Eq. 17. Using Eq. 16 we then find £\ g i (9)Ft(S) = F%{S)},9). 
The left-hand side is linear in signal features, so unless something unusual happens,^ 
h^(R, 9) must also be a linear function of R, i.e. have the form, 

h^(R, 9) = a M (0) + K{8)R V - (18) 

V 

-'E.g. if the various features Ff(S) exhibit complicated interdependencies, either because of their 
functional form or because signals S are restricted to a particular subspace. We ignore such possibilities 
here. 



11 



Downloaded from http://biorxiv.org/on September 18, 2014 



The number of diffeomorphic modes is bounded above by the number of indepen- 
dent parameters on which depends (at each 9)P For a general linear filter we see 
that there can be no more than dim(i?)[dim(_R) + 1] diffeomoprhic modes, which is 
the number of parameters and b% in Eq. 18. This bound is independent of the num- 
ber of signal features, i.e. the dimensionality of S. In particular, if R is a scalar, then 
h = a + bR. In this case we observe two diffeomorphic modes, corresponding to 
additive and multiplicative transformations of R. 

4.3 A linear-nonlinear filter 

Kinney et al. (2010) performed experiments probing the biophysical mechanism of tran- 
scriptional regulation at the Escherichia coli lac promoter (Fig. 3A). These experiments 
are of the SRM form where S is the DNA sequence of a mutated lac promoter, M is 
a measurement of the resulting gene expression, and the mRNA transcription rate T is 
the internal representation the system. Linear filters were used to model the binding 
energies Q and P of the two proteins CRP and RNAP. The specific parametric form 
used for these filters was, 

Q = J2 6 Q Sbl + 6 l P = '$2%S H + eP P , (19) 

bl bl 

where b indexes the four possible bases (A,C,G,T), / indexes nucleotide positions within 

the 75 bp promoter DNA region, Su = 1 if base b occurs at position / and Su = 0 
^Technically the number of diffeomorphic modes is the number of independent vector fields g l that 
correspond to such transformations. However, here we consider only proper diffeomorphic modes, not 
gauge transformations; as in physics, we define gauge transformations to be vector fields g l along which 
transformation of 0 leaves all predicted representations invariant. 

12 
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otherwise. 

Measurements M were taken for ~ 5 x 10 4 mutant lac promoters S. These data 
were then used to fit a model for the sequence-dependent binding energy of CRP. This 
was done by maximizing I[Q;M]. Because of the diffeomorphic modes of Q, the 
parameters 6q were inferred up to an unknown scale and the additive constant 6q was 
left undetermined. This is shown in Fig. 3B. Analogous results were obtained for RNAP 
(Fig. 3C). 

Next, a full thermodynamic model of transcriptional regulation was proposed and 
fit to the data. Based on the hypothesized biophysical mechanism, the transcription rate 
T was assumed to depend on S via, 

T ^TT~R rT where R = e ' r ~T^7^' (20) 

This quantity R is called the "regulation factor" of the promoter (Bintu et al., 2005). 
Because R is an invertible function of T, it serves equally well as the representation 
of the SRM system. In the following analysis we work with R instead of T due to its 
simpler functional form. 

When the parameters of the linear filters P and Q were simultaneously fit to data 
by maximizing I[T; M] (or equivalently, maximizing I[R; M}), three of the four dif- 
feomorphic modes described above were eliminated (Fig. 3D). Specifically, the overall 
scale of the parameters 9q and 9p were fixed, allowing binding energy predictions for 
CRP and RNAP in physically meaningful units of k B T. The parameter 0%, correspond- 
ing to the intracellular concentration of CRP, was also fixed by the data. The only 
7 

To fix the gauge freedoms of these filters, Kinney et al. (2010) adopted the convention that 
mint 9q = min;, 9p = 0 for all positions I. 

13 
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diffeomorphic mode left unbroken was 6° R , corresponding to the intracellular concen- 
tration of RNAP. 

We now show how the nonlinearity in R was able to break three of the four dif- 
feomorphic modes of P and Q. First observe that any diffeomorphic mode of a linear- 
nonlinear filter must also be a diffeomorphic mode of each individual linear filter if, as 
here, the linear filters are independent functions of S. This means any diffeomorphic 
mode g % of the full thermodynamic model for R must satisfy, 

9%R = h = {a P + bpP)d P R + {a Q + b Q Q)d Q R + a 7 d 7 R, (21) 

i 

for coefficients a P , bp, aQ, bg, a 7 which do not depend on S. Evaluating the right-hand 
side derivatives and substituting for P in terms of Q and R we find, 

R(l + e-<>)\ {a Q + b Q Q)e-Q(l-e-"') a 7 e~^ 



h = -R 



dp — bp k)£ 



.(22) 



l + e-0-7 j (l + e-0-7)(l + e -Q) 1 + e -Q-7 
For g l to be a diffeomorphic mode, the right-hand side must be independent of S for 
fixed R. The terms dependent on Q must therefore vanish, rendering bp = aQ = 6q = 
a 7 = 0.^ Any diffeomorphic modes g % must therefore satisfy £\ g l d{R = —a P R. Thus 
only one mode remains, corresponding to an additive shift in the binding energy P. 



5 Discussion 

Likelihood-based inference masks the fundamentally different ways in which data con- 
strain the parameters that lie along diffeomorphic modes versus those that lie along 
nondiffeomorphic modes. Standard likelihood inference constrains all model parame- 
ters, including both diffeomorphic and nondiffeomorphic modes, with error bars that 

o 

°This assumes 7^0, i.e. that CRP actually interacts with RNAP. Which is true. 
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scale as N^ 1 / 2 . These constraints will be consistent with the correct underlying fil- 
ter 9* when the correct noise function is used (Fig. 4A). However, use of an incorrect 
noise function will typically cause 9* to fall outside the error bars inferred along both 
diffeomorphic and nondiffeomorphic modes (Fig. 4B). 

This problem is rectified if we use a prior p(ir) that reflects our uncertainty about 
what the true noise function is. From Eq. 8 it can be seen that using the resulting 
marginal likelihood to compute a posterior distribution on 9 will constrain diffeomor- 
phic and nondiffeomorphic modes in fundamentally different ways (Fig. 4C). Nondif- 
feomorphic modes will be constrained by 1(9), which remains finite in the large N 
limit. This produces error bars on nondiffeomorphic modes comparable to those pro- 
duced by likelihood when the correct noise function n* is used. However, constraints 
along diffeomorphic modes will come only from A. Because A vanishes as N~ x ,^® 
diffeomorphic constraints become independent of N once N is sufficiently large. 

Fortunately, one does not need to posit a specific prior probability over all possible 
noise functions in order to confidently infer filters from SRM data. Using mutual infor- 
mation as an objective function instead of likelihood, i.e. sampling filters according to 
p(9\d&t&) ~ e NI{ - e \ will constrain nondiffeomorphic modes the same way that marginal 
likelihood does while putting no constraints along diffeomorphic modes (Fig. 4D). 

One might worry that a large fraction of filter parameters will be diffeomorphic, and 

that the analysis of SRM experiments will require an assumed noise function in order 

to obtain useful results even if doing so yields unreliable error bars. Such situations are 
^In this discussion we ignore gauge parameters, which do not alter model predictions and are therefore 
non-identifiable. 

l^More precisely, given any direction i in filter space, dfA\g* ~ iV -1 for N large enough. 

15 
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conceivable, but in practice this is often not the case. We have shown that for linear 
filters, the number of diffeomorphic modes will typically not exceed dim(i?) [dim(i?) + 
1] regardless of how large dim(S') is. Some of these diffeomorphic modes may also be 
eliminated if these linear filters are combined using a nonlinearity of known functional 
form. Indeed, of the 204 independent parameters comprising the biophysical model of 
transcriptional regulation inferred by Kinney et al. (2010), only one was diffeomorphic. 

A bigger concern, perhaps, is the practical difficulty of using mutual information as 
an objective function. Specifically, it remains unclear how to compute 1(9) rapidly and 
reliably enough to confidently sample from p(0|data) ~ e NI( * e \ Still, various meth- 
ods for estimating mutual information are available (Khan et al., 2007; Panzeri et al., 
2007), and the information optimization problem has been successfully implemented 
using a variety of techniques (Sharpee et al., 2004; Sharpee et al., 2006; Kinney et al., 
2007, 2010; Melnikov et al., 2012). We believe the exciting applications of mutual- 
information-based inference provide compelling motivation for making progress on 
these practical issues. 

6 Appendix A: marginal likelihood 

In certain cases A(#) can be computed explicitly and thereby be shown to vanish 
(Kinney et al., 2007). More generally, when n is taken to be finite-dimensional, a 
saddle-point computation (valid for large N) gives A(9) « ^Trflog ddD] + const. 
Here, ddD is the 7r-space Hessian of D(9, n) = D(9, ir) — -h logp(w) computed using 
ir(M\R) = p(M\R). If logp(n) and its derivatives are bounded, then the ^-dependent 
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part of A(6>) decays as iV -1 . If 7r is infinite dimensional, this saddle -point computation 
becomes a semiclassical computation in field theory akin to the density estimation prob- 
lem studied by Bialek et al. (1996). If this field theory is properly formulated through 
an appropriate choice of p(ir), then A(6>) may exhibit different decay behavior, but will 
still vanish as N — > oo. See also Rajan et al. (2013). 

7 Appendix B: DPI-satisfying measures 

DPI is satisfied by all measures of the F-information form (Csiszar & Shields, 2004; 
Kinney & Atwal, 2013), 



where F(x) is a convex function for x > 0. Mutual information corresponds to F(x) = 
x \ogx whereas F(x) = (x a — l)/(a — 1) yields a more general "Renyi information" 
measure (Renyi, 1961) that reduces to mutual information when a — 1. DPI-satisfying 
measures other than mutual information have been used for filter inference in a number 
of works, including Paninski (2003) and Kouh & Sharpee (2009). A discussion of the 
differences between DPI-satisfying measures and some non-DPI-satisfying measures 
can be found in (Kinney & Atwal, 2013). 

8 Appendix C: DPI-optimality 

Assume 8 < 9* by Eq. 14. Because R «-> R* <H- M is a Markov chain, the KL diver- 
gence between p(R*\R, M) and p(R*\R) can be decomposed as D(p(R*\R, M)\\p(R*\R)) = 

17 




(23) 
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I[R*; M] - I[R; M). If this quantity is zero, then R* <H- R -H- M is also Markov chain, 
implying 9* < 9, a contradiction. This KL divergence must therefore be positive, i.e. 
1(9) < 1(9*). So if 9* G e D pi, then for every 9 (£ 6 D pi, 6 £ 6/ as well. This proves 

e, = e DPI . 

9 Appendix D: information equivalence 

First we observe that if 9 1 and 9 2 make isomorphic predictions then they are information 
equivalent. This is readily shown from the fact that V[R; M] is invariant under arbitrary 
invertible transformations of R (Kinney & Atwal, 2013). Next we show the converse: 
if 6\ and 9 2 are information equivalent, the predictions Ri and R 2 must be isomorphic. 
Here is the proof. If 9 t ~ 9 2 , then V[R X ] M) = V[R 2 ; M) for all V, and in particular 
I[Ri, M) = I[R 2 ; M). In Appendix C we showed that I[R; M] = I[R*; M) implies 
R* R o M is a Markov chain. Imagining an SRM experiment in which 9* = 9i 
and n(M\R) = S(M - R), we find that i?i R 2 Ri is a Markov chain. This 
implies the mapping R 2 — >■ i? x is one-to-one. Similarly, Ri R 2 is one-to-one. Ri 
and i? 2 are therefore bijective. 
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Figure 1: Maximizing likelihood with an incorrect noise function will generally 
bias the inferred filter. The per-datum log likelihood L(0, ir) will typically depend on 
both the filter 9 and the noise function it in a correlated manner (left panel). Values of a 
schematic L(9,n) are illustrated in gray, with darker shades indicating larger likelihood. 
If the correct noise function n* is assumed (solid line), maximizing L(9, n*) will yield 
the correct filter 9* (filled dot). However, if an incorrect noise function n' is assumed 
(dashed line), maximizing L(9, w') will typically lead to an incorrect filter 9' (open dot). 
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Figure 2. 



Figure 2: Venn diagram illustrating filter sets maximizing different DPI- satisfying 
measures. In general, different DPI- satisfying dependence measures, e.g. mutual in- 
formation / and some other measure V, will be maximized by different sets of filters, 
respectively represented here by 0/ and 0©. ©dpi is the intersection of the optimal 
sets of all such DPI-satisfying measures. Mutual information has the important prop- 
erty that 6/ = 6dpi whenever 9* G 0; this is not true of all DPI-satisfying measures. 
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24 



Downloaded from http://biorxiv.org/on September 18, 2014 



Figure 3: A linear-nonlinear filter modeling the biophysics of transcriptional regu- 
lation at the Escherichia coli lac promoter. (A) The biophysical model inferred by Kin- 
ney et al. (2010) from Sort-Seq data. Each signal S is a 75 bp DNA sequence differing 
from the wildtype lac promoter by ~ 9 randomly scattered substitution mutations. Q 
and P denote the sequence-dependent binding energies of the proteins CRP and RNAP 
to their respective sites on this sequence S; both Q and P were modeled as linear filters 
of S. 7 is a sequence-independent interaction energy between CRP and RNAP. The re- 
sulting transcription rate T, of which the Sort-Seq assay produces noisy measurements 
M, is assumed to depend on Q, P, and 7 in a specific nonlinear manner dictated by the 
hypothesized biophysical mechanism (Eq. 20; all energies are in units of k B T). (B) The 
linear filter Q is defined by parameters Oq and 0q via Eq. 19. Inferring these parameters 
by maximizing the mutual information I[Q; M] determines 0q up to an unknown scale 
and leaves 0 Q undetermined. (C) Analogous results are obtained for the parameters Op 
and 0 P when I[P; M] is maximized. (D) Because of the inherent nonlinearity in Eq. 20 
(right-hand side), maximizing I[T;M] breaks diffeomorphic modes, fixing the values 
of Oq, Op, and 0q in units of kpT. The parameter 0 P remains undetermined. 
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Figure 4: Schematic illustration of constraints placed on diffeomorphic and non- 
diffeomorphic modes by different objective functions. The dot in each panel represents 
the correct filter 9*; shades of gray represent the posterior distribution p(0|data). (A,B) 
Likelihood (Eq. 2) places tight constraints (scaling as N~ l l 2 as iV — y oo) along both 
diffeomorphic and nondiffeomorphic modes. (A) 9* will typically lie within error bars 
if the correct noise function n* is used. (B) However, if an incorrect noise function n' is 
used, 9* will generally violate inferred constraints along both diffeomorphic and nondif- 
feomorphic modes. (C) Marginal likelihood (Eq. 7) computed using a sufficiently weak 
prior p(n) will place tight constraints on nondiffeomorphic modes and weak constraints 
(scaling as iV° as N — y oo) along diffeomorphic modes. (D) Mutual information (Eq. 3) 
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places tight constraints on nondiffeomorphic modes but provides no constraints what- 
soever on diffeomorphic modes. 
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